In [13]:
%matplotlib inline

Practical introduction to distributional semantic models using PyDSM

You can find this notebook at bit.ly/pydsmmeetup.

This presentation will guide you through a short introduction to working with distributional semantic models (DSM) using the Python package PyDSM.

To successfully install the package, you first need the following installed locally:

  • Python 3.4
  • Cython
  • Scipy

I recommend using the scientific Python distribution Anaconda to ease the process.

Installation

When you have installed the requirements, this should do the trick:

git clone https://github.com/jimmycallin/pydsm
cd pydsm
python setup.py install

Make sure everything is installed correctly by open your favorite REPL (which should be IPython), and import the library.


In [2]:
import pydsm

Outline

  • Overview of PyDSM
  • Common DSM operations
  • First steps towards 100% correct synonym test

PyDSM is a small Python library made for exploratory analysis of distributional semantic models. So far, two common models are available: CooccurrenceDSM and RandomIndexing. RandomIndexing is an implemented version of the model we use at Gavagai. For a detailed explanation, I recommend reading this introduction.

Building a common collocation matrix

CooccurrenceDSM requires a set window_size and a corpus. We will use a small preprocessed portion of the ukWaC , which is available here. Let's build the DSM.


In [4]:
dsm = pydsm.build(model=pydsm.CooccurrenceDSM, 
                  corpus='ukwac.100k.clean.txt',
                  window_size=(2,2),
                  min_frequency=50)


Building matrix from corpus with config: {'min_frequency': 50, 'window_size': (2, 2)}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Total time of build: 1050.10 sec

We removed all words having a total occurrence frequency below 50 to keep things clean.


In [28]:
# Sort columns and display some word vectors
dsm[['never', 'gonna', 'give', 'you', 'up'],['never', 'gonna', 'let', 'you', 'down']]


Out[28]:
[5, 5]  never    gonna   let      you       down
never   446.00   34.00   268.00   2750.00   81.00
gonna   34.00    0.00    9.00     206.00    4.00
give    147.00   22.00   196.00   5669.00   25.00
you     2750.00  206.00  1997.00  17250.00  1420.00
up      325.00   15.00   170.00   4016.00   1656.00

Common operations

  • Apply weighting functions, found in pydsm.weighting.

In [4]:
ppmi = dsm.apply_weighting(pydsm.weighting.ppmi)


Total time of apply_weighting: 10.66 sec
  • Look for nearest neighbors.

In [5]:
ppmi.nearest_neighbors('blue')


Total time of nearest_neighbors: 3.81 sec
Out[5]:
[380405, 1]  blue
blue         1.00
red          0.18
yellow       0.18
black        0.17
white        0.17
grey         0.16
winged       0.16
throated     0.15
tailed       0.15
green        0.15
...          ...

Compose distributional vectors

Compositional distributional semantics is a whole field trying to find methods of composing distributional vectors to extract the compositional meaning. This generally doesn't work, and especially not on this little data.


In [6]:
car_brand = dsm.compose('car', 'brand', comp_func=pydsm.composition.multiplicative)
car_brand


Out[6]:
[1, 380405]  iskreno  9ql   bewhildered  componenttype  ranjanbaa  ruprecht  privilaged  ...
car brand    0.00     0.00  0.00         0.00           0.00       0.00      0.00        ...

In a perfect world, its neighbors would contain a lot more politics, rather than a mixture of each word's neighbors. Still, addition of distributional vectors have shown to perform best in many machine learning tasks when using the composed vector as input.


In [7]:
ppmi.nearest_neighbors(car_brand)


Total time of nearest_neighbors: 3.49 sec
Out[7]:
[380405, 1]            car brand
cynick                 0.76
basinful               0.76
polysemantism          0.76
photogrph              0.76
intellection           0.72
deprofessionalization  0.72
crosssection           0.72
unsuitableness         0.72
spandrel               0.70
elegent                0.70
...                    ...

Evaluating the DSM

We will evaluate the quality our DSM with a commonly used synonym test, TOEFL. This test contains 80 questions of type "what is the best synonym for a word given a set of similar words".

hasten: accelerate, accompany, determine, permit
tranquillity: harshness, peacefulness, weariness, happiness
make: print, trade, borrow, earn

This corresponds to measuring the distance to all candidates and pick the word that is closest to the problem word.


In [8]:
dsm.evaluate(pydsm.evaluation.toefl)


Evaluation report
Accuracy: 0.525
Number of words: 80
Correct words: ['levied', 'sufficient', 'distribute', 'expeditious', 'flawed', 'perilous', 'peculiarly', 'roots', 'unlikely', 'recognized', 'hailed', 'generally', 'narrow', 'concocted', 'fashion', 'showed', 'costly', 'constantly', 'marketed', 'colloquial', 'issues', 'essentially', 'hind', 'haphazardly', 'annals', 'situated', 'slowly', 'provisions', 'arranged', 'urgently', 'wildly', 'infinite', 'prospective', 'verbally', 'showy', 'grin', 'percentage', 'primarily', 'hastily', 'bigger', 'prominent', 'built']
Incorrect words: ['command', 'physician', 'spot', 'debate', 'deftly', 'hue', 'furnish', 'make', 'perseverance', 'prolific', 'easygoing', 'tasks', 'consumed', 'uniform', 'normally', 'sustained', 'discrepancies', 'principal', 'highlight', 'terminated', 'solitary', 'unmatched', 'enormously', 'salutes', 'advent', 'dissipate', 'concisely', 'keen', 'zenith', 'tranquillity', 'figure', 'resolved', 'hasten', 'temperate', 'fanciful', 'often', 'feasible']
Unknown words: ['halfheartedly']
Unknown correct synonym: []
Total time of evaluate: 0.98 sec
Out[8]:
0.525

That's not even close to 100%, and there are several reasons for this. Most notably, simple cooccurrence counting will lead to distributional vectors skewed towards ubiquotous words such as function words. We need to have a weighting function that decrease the importance of commonly occuring words, and increase the importance of words that occur in specific contexts. This is what Positive Pointwise Mutual Information (PPMI) does.

Let's weight the matrix


In [15]:
dsm.visualize(vis_func=pydsm.visualization.heatmap)



In [14]:
ppmi = dsm.apply_weighting(pydsm.weighting.ppmi)
ppmi.visualize(vis_func=pydsm.visualization.heatmap)


Total time of apply_weighting: 46.60 sec

Visualizing them as heatmaps, we can see how the vector values are more evenly distributed after applying PPMI.

See Bullinaria & Levy "Extracting Semantic Representations from Word Co-occurrence Statistics: A Computational Study" (2007) for details.


In [11]:
ppmi.evaluate(pydsm.evaluation.toefl)


Evaluation report
Accuracy: 0.825
Number of words: 80
Correct words: ['levied', 'sufficient', 'distribute', 'expeditious', 'spot', 'flawed', 'perilous', 'peculiarly', 'debate', 'deftly', 'hue', 'roots', 'unlikely', 'recognized', 'make', 'hailed', 'prolific', 'easygoing', 'generally', 'narrow', 'concocted', 'fashion', 'consumed', 'showed', 'costly', 'uniform', 'normally', 'sustained', 'discrepancies', 'constantly', 'marketed', 'principal', 'highlight', 'colloquial', 'issues', 'essentially', 'hind', 'haphazardly', 'annals', 'situated', 'enormously', 'salutes', 'slowly', 'arranged', 'advent', 'dissipate', 'urgently', 'infinite', 'prospective', 'verbally', 'showy', 'grin', 'percentage', 'keen', 'zenith', 'primarily', 'tranquillity', 'hastily', 'bigger', 'prominent', 'resolved', 'hasten', 'fanciful', 'often', 'built', 'feasible']
Incorrect words: ['command', 'physician', 'furnish', 'perseverance', 'tasks', 'terminated', 'solitary', 'unmatched', 'provisions', 'wildly', 'concisely', 'figure', 'temperate']
Unknown words: ['halfheartedly']
Unknown correct synonym: []
Total time of evaluate: 1.64 sec
Out[11]:
0.825

While 82.5% is a significant improvement, we still have a long way to go. B&L (2012) [1] succeeded in getting 100% on the TOEFL test. We will be using some of their techniques to improve our own score, but since our corpus is about a 50th of the size of theirs we won't be getting that much improvements. Still, the technique is interesting.

Singular Value Decomposition

Singular Value Decomposition (SVD) has been known for quite a while to improve the quality, and is a part of the LSA model. Traditionally, the first couple of hundreds or thousands principal components have been used based on the belief that the rest mostly contains additional noise. This has indeed increased the performance. What B&L found out was that by also removing the first 100-200 principal components, the result increased even further.

[1] Bullinaria, J.A. & Levy, J.P. Extracting Semantic Representations from Word Co-occurrence Statistics: Stop-lists, Stemming and SVD. (2012)


In [14]:
# Sort the matrix columns according to the collapsed column sum in descending order. 
# This is basically equivalent to sort according to word frequency.
dsm.matrix = dsm.matrix.sort(key='sum', axis=1, ascending=False).sort(key='sum', axis=0, ascending=False)
# Pick out the 50k first columns to ease the SVD.
dsm.matrix = dsm.matrix[:50000,:50000]
# Perform PPMI on this again
ppmi = dsm.apply_weighting(pydsm.weighting.ppmi)


Total time of apply_weighting: 6.88 sec

In [7]:
# Perform SVD picking out the 3000 largest principal components. This is slow (hours) and requires a lot of memory.
u, s, v_t = ppmi.matrix.svd(k=3000)
dim_reduced = u.multiply(s)
dim_reduced


Out[7]:
[53000, 3000]  2999   2998    2997   2996    2995    2994   2993    ...
the            37.79  -13.88  3.74   -25.55  -2.25   2.54   -5.42   ...
of             43.12  -20.53  0.07   -31.56  4.85    5.38   4.02    ...
and            84.14  -23.96  14.96  -77.18  -11.15  6.06   -11.65  ...
to             25.51  -12.69  3.79   -15.56  -3.79   17.18  -14.11  ...
a              44.72  -21.55  6.81   -24.97  -11.62  -7.06  -10.99  ...
in             41.98  -4.56   -4.61  -32.90  -3.87   12.83  -7.43   ...
is             31.47  -14.21  4.61   -24.45  -4.49   14.61  -8.62   ...
for            37.66  -19.14  -1.60  -27.96  9.65    6.24   -5.81   ...
that           23.66  -9.99   6.60   -16.04  -3.16   17.98  -9.37   ...
on             33.41  -7.51   -0.72  -26.93  -8.44   3.53   -5.98   ...
...            ...    ...     ...    ...     ...     ...    ...     ...

Because of the nature of SVD and its matrix factorization (both matrices are orthogonal, thus v_t is nothing more than a rotation of the vector space) and the fact that our distance measure is the cosine distance, we can use u * s directly and discard v_t. See the paper for details.


In [10]:
pydsm.evaluation.toefl(dim_reduced)


Evaluation report
Accuracy: 0.8
Number of words: 80
Correct words: ['fashion', 'make', 'provisions', 'showed', 'peculiarly', 'feasible', 'issues', 'costly', 'often', 'keen', 'resolved', 'levied', 'advent', 'salutes', 'terminated', 'prominent', 'primarily', 'slowly', 'roots', 'sustained', 'debate', 'bigger', 'expeditious', 'discrepancies', 'percentage', 'prolific', 'normally', 'distribute', 'tasks', 'built', 'unlikely', 'solitary', 'annals', 'generally', 'perilous', 'urgently', 'infinite', 'essentially', 'dissipate', 'prospective', 'highlight', 'situated', 'constantly', 'recognized', 'deftly', 'marketed', 'spot', 'concisely', 'hind', 'wildly', 'colloquial', 'consumed', 'arranged', 'concocted', 'hue', 'sufficient', 'hastily', 'enormously', 'flawed', 'principal', 'hailed', 'showy', 'grin', 'verbally']
Incorrect words: ['temperate', 'narrow', 'physician', 'perseverance', 'furnish', 'command', 'hasten', 'fanciful', 'zenith', 'figure', 'uniform']
Unknown words: ['halfheartedly', 'easygoing', 'haphazardly']
Unknown correct synonym: ['unmatched', 'tranquillity']
Out[10]:
0.8

Running the evaluation only on the SVD matrix actually worsens the result compared to the original PPMI matrix. This is expected (see paper for why). But what happens next when we remove the largest principal components is interesting.


In [16]:
# Ignoring the largest principal components
pydsm.evaluation.toefl(dim_reduced[:,200:])


Evaluation report
Accuracy: 0.8625
Number of words: 80
Correct words: ['fashion', 'make', 'provisions', 'showed', 'feasible', 'issues', 'costly', 'often', 'keen', 'resolved', 'levied', 'advent', 'salutes', 'terminated', 'prominent', 'primarily', 'slowly', 'roots', 'sustained', 'debate', 'bigger', 'expeditious', 'discrepancies', 'percentage', 'temperate', 'prolific', 'normally', 'distribute', 'tasks', 'built', 'unlikely', 'solitary', 'physician', 'generally', 'perilous', 'urgently', 'infinite', 'perseverance', 'essentially', 'dissipate', 'command', 'prospective', 'highlight', 'situated', 'constantly', 'recognized', 'deftly', 'marketed', 'hasten', 'spot', 'concisely', 'hind', 'wildly', 'colloquial', 'consumed', 'arranged', 'fanciful', 'concocted', 'hue', 'sufficient', 'hastily', 'enormously', 'flawed', 'figure', 'uniform', 'principal', 'hailed', 'grin', 'verbally']
Incorrect words: ['peculiarly', 'narrow', 'annals', 'furnish', 'zenith', 'showy']
Unknown words: ['halfheartedly', 'easygoing', 'haphazardly']
Unknown correct synonym: ['unmatched', 'tranquillity']
Out[16]:
0.8625

Why removal of the vectors that contain the directions of the largest variances improves the result is not entirely clear. When B&L adds extra random noise in addition to the model, it is clearly shown that removing the first principal components also removes most of the added noise. The previous assumption that noise is mainly gathereed in the smallest principal components is simply not true.

And this is as far as we get. Because of some parameter choices to ease the calculation and our smaller corpus, we don't quite reach the same score as B&L. Feel free to try to get all the way to the top. And if you do, I would be very interested to hear about it!

Thanks!